Assignment 1


Black Friday Sales Data Analysis

1. Introduction

Black Friday is one of the largest shopping events worldwide, where customers make bulk purchases across different product categories. Understanding customer behavior and product performance is crucial for businesses to optimize marketing strategies and improve sales.

2. Dataset Description

The dataset consists of transactions from a retail store on Black Friday. It includes information about customer demographics, product categories, and purchase amounts.

3. Column Descriptions

Column Name Description
User_ID Unique customer identifier
Product_ID Unique product identifier
Gender Customer gender (M/F)
Age Age group of the customer
Occupation Customer occupation category (masked)
City_Category City type (A = Metro, B = Tier 1, C = Tier 2)
Marital_Status Marital status (0 = Single, 1 = Married)
Product_Category_1 Primary product category
Purchase Purchase amount

Columns Removed: Product_Category_2, Product_Category_3, Stay_In_Current_City_Years (as they are either redundant or not useful for the analysis).

Workflow

flowchart TD;
    A[Start] --> B[Load Dataset]
    B --> C[Data Cleaning]
    C --> D[Statistical Analysis]
    D --> E[Exploratory Data Analysis]
    E --> F[Outlier Detection]
    F --> G[Linear Regression Model]
    G --> H[Model Evaluation]
    H --> I[Conclusion]

Loading Libraries

### Load Required Libraries
library(tidyverse)
library(kableExtra)
library(dplyr)
library(tibble)
library(performance)
library(ggplot2)
library(scales)

4. Data Import and Cleaning

1. Loading Dataset

The dataset was imported using the read_csv() function and stored in a dataframe.

# Read the dataset
black_friday <- read_csv("/Users/amitkumar/Desktop/MSc Business Analytics/Sem 2/applied analytics/r_applied_analytics/Assignment_1/dataset/Black_friday_sales_dataset.csv")
# Convert categorical variables to factors
black_friday <- black_friday %>%
  mutate(
    Gender = as.factor(Gender),
    Age = as.factor(Age),
    City_Category = as.factor(City_Category),
    Occupation = as.factor(Occupation),
    Marital_Status = as.factor(Marital_Status),
    Product_Category_1 = as.factor(Product_Category_1)
  )

2. Structure of Dataset

# Check the structure of the dataset
str(black_friday)
tibble [550,068 × 12] (S3: tbl_df/tbl/data.frame)
 $ User_ID                   : num [1:550068] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
 $ Product_ID                : chr [1:550068] "P00069042" "P00248942" "P00087842" "P00085442" ...
 $ Gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
 $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
 $ Occupation                : Factor w/ 21 levels "0","1","2","3",..: 11 11 11 11 17 16 8 8 8 21 ...
 $ City_Category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
 $ Stay_In_Current_City_Years: chr [1:550068] "2" "2" "2" "2" ...
 $ Marital_Status            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 2 ...
 $ Product_Category_1        : Factor w/ 20 levels "1","2","3","4",..: 3 1 12 12 8 1 1 1 1 8 ...
 $ Product_Category_2        : num [1:550068] NA 6 NA 14 NA 2 8 15 16 NA ...
 $ Product_Category_3        : num [1:550068] NA 14 NA NA NA NA 17 NA NA NA ...
 $ Purchase                  : num [1:550068] 8370 15200 1422 1057 7969 ...

3. View the first few rows

# View the first few rows
head(black_friday)
# A tibble: 6 × 12
  User_ID Product_ID Gender Age   Occupation City_Category
    <dbl> <chr>      <fct>  <fct> <fct>      <fct>        
1 1000001 P00069042  F      0-17  10         A            
2 1000001 P00248942  F      0-17  10         A            
3 1000001 P00087842  F      0-17  10         A            
4 1000001 P00085442  F      0-17  10         A            
5 1000002 P00285442  M      55+   16         C            
6 1000003 P00193542  M      26-35 15         A            
# ℹ 6 more variables: Stay_In_Current_City_Years <chr>, Marital_Status <fct>,
#   Product_Category_1 <fct>, Product_Category_2 <dbl>,
#   Product_Category_3 <dbl>, Purchase <dbl>

Getting a random view of dataset for a better understanding

# Getting a random view of dataset for a better understanding
sample_n(black_friday,10)
# A tibble: 10 × 12
   User_ID Product_ID Gender Age   Occupation City_Category
     <dbl> <chr>      <fct>  <fct> <fct>      <fct>        
 1 1003945 P00050342  F      46-50 1          B            
 2 1002309 P00177142  F      46-50 0          C            
 3 1002552 P00110842  F      26-35 9          C            
 4 1002931 P00139942  M      18-25 4          B            
 5 1001722 P00180342  M      36-45 0          C            
 6 1005892 P00119242  M      46-50 2          C            
 7 1001470 P00246442  M      18-25 4          A            
 8 1004859 P00282642  F      18-25 14         C            
 9 1002482 P00298742  F      51-55 3          C            
10 1002820 P00252442  F      36-45 0          A            
# ℹ 6 more variables: Stay_In_Current_City_Years <chr>, Marital_Status <fct>,
#   Product_Category_1 <fct>, Product_Category_2 <dbl>,
#   Product_Category_3 <dbl>, Purchase <dbl>

4. Removing Unnecessary Columns

Columns such as Stay_In_Current_City_Years, Product_Category_2 and Product_Category_3 were removed as they do not contribute to the analysis.

# Remove unnecessary columns
black_friday <- black_friday %>% select(-c(Product_Category_2, Product_Category_3, Stay_In_Current_City_Years))

5. Handling Missing Values

# Check missing values
missing_values <- colSums(is.na(black_friday))
print(missing_values)
           User_ID         Product_ID             Gender                Age 
                 0                  0                  0                  0 
        Occupation      City_Category     Marital_Status Product_Category_1 
                 0                  0                  0                  0 
          Purchase 
                 0 

5. Statistical Analysis

# used chatgpt for code
### 5. Statistical Metrics, as there is only one numeric attribute i.e. Purchase
stats_summary <- tibble(
  Statistic = c("Mean", "Median", "Std Dev", "Min", "Max"),
  Purchase = c(mean(black_friday$Purchase), median(black_friday$Purchase),
               sd(black_friday$Purchase), min(black_friday$Purchase), max(black_friday$Purchase))
)

print(stats_summary)
# A tibble: 5 × 2
  Statistic Purchase
  <chr>        <dbl>
1 Mean         9264.
2 Median       8047 
3 Std Dev      5023.
4 Min            12 
5 Max         23961 

6. Exploratory Data Analysis

6.1 The percentage of spending by gender

# Total spending by gender
gender_spending <- black_friday %>%
  group_by(Gender) %>%
  summarise(Total_Spending = sum(Purchase, na.rm = TRUE))

# Calculate the percentage of total spending
gender_spending <- gender_spending %>%
  mutate(Percentage_Spending = (Total_Spending / sum(Total_Spending)) * 100)


# used chatgpt for code
# Plot the percentage of spending by gender
ggplot(gender_spending, aes(x = Gender, y = Percentage_Spending, fill = Gender)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel1") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format as percentage
  labs(title = "Percentage of Total Spending by Gender",
       x = "Gender",
       y = "Percentage of Total Spending") +
  theme_minimal()

Males contributed significantly more with 78% to total sales than females with 22%.

6.2 Spending by Age Group

# used chatgpt for code
# Total spending by each age group
age_spending <- black_friday %>%
  group_by(Age) %>%
  summarise(Total_Spending = sum(Purchase))

# Calculate the percentage of total spending
age_spending <- age_spending %>%
  mutate(Percentage_Spending = (Total_Spending / sum(Total_Spending)) * 100)

# Plot the percentage of spending by age group
ggplot(age_spending, aes(x = Age, y = Percentage_Spending, fill = Age)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel2") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format as percentage
  labs(title = "Percentage of Total Spending by Age Group",
       x = "Age Group",
       y = "Percentage of Total Spending") +
  theme_minimal()

  • Customers in the 26-35 age group had the highest total spending.

  • Older age groups showed lower spending trends.

6.3 Top 5 Product Categories by Revenue

# Top 5 product categories by revenue
product_revenue <- black_friday %>%
  group_by(Product_Category_1) %>%
  summarise(Total_Revenue = sum(Purchase)) %>%
  arrange(desc(Total_Revenue)) %>%
  head(5)

# used chatgpt for code
ggplot(product_revenue, aes(x = reorder(Product_Category_1, Total_Revenue), y = Total_Revenue, fill = Product_Category_1)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel2") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Top 5 Product Categories by Revenue", x = "Product Category", y = "Total Revenue") +
  coord_flip() +
  theme_minimal()

7. Outlier Detection & Removal

# Check for outliers using boxplot
boxplot(black_friday$Purchase, main = "Boxplot of Purchase Amounts", col = "lightblue")

# Remove outliers using IQR method
Q1 <- quantile(black_friday$Purchase, 0.25)
Q3 <- quantile(black_friday$Purchase, 0.75)
IQR_value <- Q3 - Q1

# Define upper and lower limits
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Filter out outliers (Regular Transactions)
black_friday_clean <- black_friday %>%
  filter(Purchase >= lower_bound & Purchase <= upper_bound)

# Identify high-value transactions (Potential Premium Purchases)
extreme_high_values <- black_friday %>%
  filter(Purchase > upper_bound)
  • Outliers were detected using the IQR method.

  • Transactions exceeding the upper bound were considered high-value transactions.

  • A boxplot was used to visualize these high-value purchases.

# Show how many rows were removed
paste("Rows before cleaning:", nrow(black_friday))
[1] "Rows before cleaning: 550068"
paste("Rows after cleaning:", nrow(black_friday_clean))
[1] "Rows after cleaning: 547391"
paste("Rows removed:", nrow(black_friday) - nrow(black_friday_clean))
[1] "Rows removed: 2677"
# Display top high-value transactions
# Potential premium purchases
head(extreme_high_values)
# A tibble: 6 × 9
  User_ID Product_ID Gender Age   Occupation City_Category Marital_Status
    <dbl> <chr>      <fct>  <fct> <fct>      <fct>         <fct>         
1 1000058 P00117642  M      26-35 2          B             0             
2 1000062 P00119342  F      36-45 3          A             0             
3 1000126 P00087042  M      18-25 9          B             0             
4 1000139 P00159542  F      26-35 20         C             0             
5 1000175 P00052842  F      26-35 2          B             0             
6 1000235 P00116142  M      26-35 0          B             0             
# ℹ 2 more variables: Product_Category_1 <fct>, Purchase <dbl>

7.1 Premium Purchases by Age Group

# ---- Percentage of High-Value Transactions by Age ---- #

# used chatgpt for code
# Count total transactions per age group
total_purchases_by_age <- black_friday %>%
  group_by(Age) %>%
  summarise(Total_Transactions = n())

# Count high-value transactions per age group
high_value_purchases_by_age <- extreme_high_values %>%
  group_by(Age) %>%
  summarise(High_Value_Transactions = n())

# Merge datasets to calculate percentage
age_high_value_percentage <- merge(total_purchases_by_age, high_value_purchases_by_age, by = "Age", all.x = TRUE)

# Replace NA (for groups with zero high-value transactions) with 0
age_high_value_percentage$High_Value_Transactions[is.na(age_high_value_percentage$High_Value_Transactions)] <- 0

# Calculate percentage of high-value transactions
age_high_value_percentage <- age_high_value_percentage %>%
  mutate(Percentage_High_Value = (High_Value_Transactions / Total_Transactions) * 100)

# ---- Visualizing High-Value Purchase Percentage by Age ---- #

ggplot(age_high_value_percentage, aes(x = Age, y = Percentage_High_Value, fill = Age)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel2") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Convert to percentage format
  labs(title = "Percentage of High-Value Transactions by Age Group",
       x = "Age Group",
       y = "Percentage of Premium Purchases") +
  theme_minimal()

  • The percentage of high-value purchases was calculated for each age group.

8. Predictive Modeling: Linear Regression

1. Selecting Features

  • Independent variables: Gender, Age, and City_Category.

  • The categorical variables were converted into numeric form.

# used chatgpt for code
# Convert categorical variables to numeric factors
black_friday <- black_friday %>%
  mutate(
    Gender = as.numeric(as.factor(Gender)),  # Male = 1, Female = 2
    Age = as.numeric(as.factor(Age)),  # Convert Age groups to numbers
    City_Category = as.numeric(as.factor(City_Category))  # Convert City A/B/C to numbers
  )
# Build linear regression model
linear_model <- lm(Purchase ~ Gender + Age + City_Category, data = black_friday)

# Display model summary
summary(linear_model)

Call:
lm(formula = Purchase ~ Gender + Age + City_Category, data = black_friday)

Residuals:
   Min     1Q Median     3Q    Max 
 -9925  -3525  -1127   2923  15653 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7088.902     36.997 191.609  < 2e-16 ***
Gender         706.601     15.647  45.159  < 2e-16 ***
Age             31.812      5.023   6.334 2.39e-10 ***
City_Category  403.933      8.943  45.166  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5004 on 550064 degrees of freedom
Multiple R-squared:  0.007581,  Adjusted R-squared:  0.007576 
F-statistic:  1401 on 3 and 550064 DF,  p-value: < 2.2e-16
  • The model tries to predict purchase amount based on Gender, Age, and City Category.

  • It is statistically significant but explains less than 1% of the variation in purchases, meaning it does not predict well.

9. Conclusion

This analysis of Black Friday Sales data helped us understand customer spending habits. We also identified high-value purchases using outlier detection and built a Linear Regression model to predict spending. These insights can help businesses target the right customers, improve marketing, and manage inventory better. Future improvements could include using advanced machine learning models to make better predictions.